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1  Executive  Summary 

This  DURIP  equipment  award  was  used  to  purchase,  install,  and  bring  on-line  two  Berkeley 
Emulation  Engines  (BEEs)  and  two  mini-BEE  machines  to  establish  an  FPGA-based  high- 
performance  multiagent  training  platform  and  its  associated  software.  This  acquisition  of 
BEE4-W  (Berkeley  Emulation  Engine)  hardware  platforms  has  costed  $201,500.00  in  total. 
To  accelerate  both  the  multiagent  software  simulation  and  hardware  development,  a  DELL 
integrated  cluster  environment  unit  was  purchased,  featuring  192  2.8  GHz  processors  (on 
24  nodes)  with  384  GB  high-  speed  RAM  on  an  Inhniband  backbone.  We  also  installed 
cluster  control  software  and  licenses  for  the  UBUNTU-based  Linux  operating  system.  Soft¬ 
ware  was  installed  including  JAVA  and  G/G-I--I-  compiler  suites  (debuggers,  performance 
measurements,  etc.),  the  Xilinx  ISE  software  suite,  and  the  Matlab  software.  The  system 
was  ordered  in  late  2012  and  arrived  in  early  2013  and  was  up  and  running  routinely  with 
local  users  in  a  few  weeks.  This  delay  of  equipment  arrival  is  largely  due  to  the  fact  that  all 
BeeGube  systems  are  highly  customized  hardware.  Numerous  research  groups  at  UGF  have 
made  extensive  use  of  the  DURIP  cluster,  running  several  different  versions  of  Xilinx  ISE 
design  suite.  This  DURIP  award  has  greatly  increased  turnaround  time  and  productivity 
for  these  groups. 

Specihcally,  the  purchased  equipments  have  enhanced  ongoing  DARPA  and  ARO  funded 
activities.  Several  exploratory  research  tasks  using  these  new  equipments  are  currently  un¬ 
dergoing  at  UGF  to  1)  alleviate  the  computing  performance  bottleneck  imposed  by  software- 
based  simulation  while  conducting  multiagent  learning  research,  2)  signihcantly  increase  the 
multiagent  HyperNEAT  learning  intensity  to  facilitate  solving  real-world  problems,  3)  en¬ 
able  training  teams  at  much  larger  sizes  than  previously  possible  in  simulation,  and  4)  make 
training  adaptive  neural  networks  with  genuine  synaptic  plasticity  feasible.  In  the  long  term, 
this  equipment  acquisition  can  expand  the  research  capability  at  the  University  of  Gentral 
Florida  (UGF)  in  the  area  of  Robotics,  Artihcial  Intelligence  (AI),  and  evolvable  hardware 
by  forging  a  close  collaboration  between  two  research  teams  at  UGF;  one  focusing  on  multi¬ 
agent  robot  training  (based  on  HyperNEAT  technology  and  a  hive  brain)  and  the  other  on 
high-performance  reconhgurable  computing  (BGM,  MARG,  and  evolvable  hardware). 

To  date,  we  have  built  the  proposed  hardware  platform  largely  based  on  the  existing 
technology  of  our  Evolutionary  Gomplexity  Research  Group  (EPlex)  and  High-Performance 
Reconhgurable  Gomputing  Laboratory.  Gurrently,  we  are  focusing  on  developing  a  proba¬ 
bilistic  computing  engine  for  high-performance  convolutions.  This  is  motivated  by  the  fact 
that  both  convolution  and  correlation  are  fundamental  building  blocks  in  scalable  multia¬ 
gent  learning  and  training  a  multiagent  hive  brain.  In  addition,  this  equipment  acquisition 
has  enabled  us  for  the  hrst  time  to  set  up  a  practical  educational  platform  at  UGF.  In  this 
semester,  we  have  experimented  with  using  the  purchased  platforms  for  hardware  software 
co-design  class.  In  the  long  term,  we  believe  that  all  these  outcome  from  the  proposed 
equipment  acquisition  can  beneht  various  tasks  relevant  to  DoD  missions  in  the  held,  such 
as  dynamic  cordoning,  coordinated  surveillance,  and  multi-robot  teleoperation. 

Key  Words:  Hardware-Assisted;  FPGA;  BEE;  Hive  Brain;  Multiagent. 
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2  Research  Enhanced 


•  DARPA  CSSG  Phase  3:  Real-world  Scalable  Multiagent  Learning  for  Coordinated  UGV 
Operations;  Period:  8/11  -  8/12;  Amount:  $249,347;  PL  Kenneth  O.  Stanley;  Program 
Manager:  James  Donlon. 

•  ARO  Network  Sciences  Division:  Training  a  Multiagent  Hive  Brain  for  Coordinated 
UCV  Operations]  Period:  9/11-9/14;  Amount:  $150,000  in  hrst  year;  PI:  Kenneth  O. 
Stanley;  Program  Manager:  Purush  Iyer 


3  Description  of  Purchased  Equipments 


We  have  purchased  the  following  equipments  as  proposed  in  our  DURIP  project.  Table  1 
lists  their  short  descriptions  and  price  information. 


No. 

Item 

Description 

Qty 

Unit  Price 

Total 

1 

BEE4-LX550T-2C-CM 

BEE4  LX550T  Configuration  for  University  Only 

2 

$  100,750 

$  201,500 

2 

BEE4-DRAM-64G-CM 

64GB  DDR3-1066  x4  RDIMMs  for  BEE4  system 

included 

3 

BEE4-PCIE-LP-CM 

PCI  Express  Kit  w/  low-profile  bracket 

included 

4 

BEE4-WTY-1Y-CM 

1  year  limited  hardware  warrant 

included 

Total 

$201,500.00 

Table  1:  Equipment  list. 


BEE  (Berkeley  Emulation  Engine)  Platform 


Figure  1:  Interconnect  Network  Architecture  of  a  BEE  platform. 

We  purchased  the  Berkeley  Emulation  Engine  (BEE  4)  platforms  to  establish  an  FPGA- 
based  high-performance  multiagent  training  platform.  The  Berkeley  Emulation  Engine 
(BEE4)  is  a  commercial,  stackable  full  speed  multi-FPGA  based  prototyping  platform,  in¬ 
tegrated  with  DAG/ADG  modules  for  mixed  signal  and  digital  communications  designs. 
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Specifically,  each  BEE4  system  or  module  consists  of  4  Xilinx  Virtex-6  FPGAs  based  on 
a  unique  HPC  (High  Performance  Computing)  Honeycomb  architecture,  which  allow  us  to 
build  application-specihc  computing  platform  to  replace  a  large  number  of  computers.  Each 
BEE4  module  supports  a  variety  of  high-end  Virtex-6  FPGAs,  including  LXT  240/365/550 
and  SXT  315/475,  allowing  BEE4  to  support  20  MGate  designs  per  module,  or  400  MGates 
per  rack,  all  of  which  are  interconnected  through  high-speed  links.  This  FPGA-based  com¬ 
puting  platform  can  1 )  significantly  increase  the  number  of  evaluations  possible  in  multia¬ 
gent  HyperNEAT  learning  to  facilitate  solving  real-world  problems,  2)  enable  training  teams 
at  much  larger  sizes  than  previously  possible  in  simulation,  and  3)  make  adaptive  neural 
networks  through  synaptic  plasticity  feasible.  All  these  benehts  can  alleviate  the  computing 
performance  bottleneck  imposed  by  software-based  simnlation  while  condncting  mnltiagent 
learning  research.  In  addition,  BEE4  modules  can  be  stacked  or  clustered  to  increase  ca¬ 
pacity  withont  losing  speed,  which  is  critical  as  we  scale  np  the  size  of  onr  mnltiagent  robot 
training  system.  Finally,  because  of  the  versatility  of  the  BEE4  platform,  this  facility  can 
serve  the  research  and  teaching  needs  of  several  faculty  members  in  our  department  and 
enhance  onr  research  in  these  areas. 


4  Examples  of  Usage 

The  two  DARPA-sponsored  projects,  the  Real-world  Scalable  Multiagent  Learning  for  Coor¬ 
dinated  UGV  Operations  (Period:  8/11  -  8/12;  Amonnt:  $249,347;  PI:  Kenneth  O.  Stanley; 
Program  Manager:  James  Donlon)  and  the  Training  a  Multiagent  Hive  Brain  for  Coordi¬ 
nated  UCV  Operations  (Period:  9/11-9/14;  Amonnt:  $150,000  in  Erst  year;  PI:  Kenneth 
O.  Stanley;  Program  Manager:  Pnrush  Iyer)  have  made  extensive  use  of  the  computational 
cluster.  (These  projects  are  given  highest  priority  usage.)  When  the  hardware  emulation  en¬ 
gines  constrncted  with  BEE4  platforms  are  hnished,  these  two  projects  will  use  the  hardware 
platforms  extensively. 

4.1  Main  Task  I:  High-Throughput  FPGA-based  Multiagent  Train¬ 
ing  Platform 

We  are  cnrrently  using  the  reconhgurable  hardware  within  the  pnrchased  BEE4  platforms 
to  constrnct  a  high  thronghpnt  FPGA-based  mnltiagent  training  platform  capable  of  eval- 
nating  probabilistic  networks  with  arbitrary  DAG  (directed  acyclic  graph)  topology.  We 
have  hnished  building  the  application-specihc  processor  cores  that  are  specihcally  tailored 
for  mnltiagent  training.  We  are  working  on  constructing  application-specihc  memory  access 
network  to  signihcantly  improve  the  memory  bandwidth  by  avoiding  “Von-Nenmann  bottle¬ 
neck”  .  Eventnally,  we  plan  to  replace  networks  between  clustered  computer  with  high-speed 
inter-FPGA  links  with  200  times  more  bandwidth. 

The  above  scheme  has  been  built  on  the  top  of  our  previous  work  work  of  building 
large-scale  Bayesian  Gompnting  Machine  using  reconhgurable  logic,  where  we  successfully 
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developed  a  high  throughput  Bayesian  computing  machine  (BCM)  that  not  only  can  be 
applied  to  many  important  algorithms  in  artificial  intelligence,  signal  processing,  and  digital 
communications,  but  also  has  high  reusability,  i.e.,  a  new  application  needs  not  change  a 
BCM’s  hardware  design,  only  new  task  graph  processing  and  code  compilation  are  necessary. 
In  this  DURIP  effort,  we  are  currently  working  on  make  the  computing  architecture  of  BCM 
to  scale  linearly  with  the  size  of  the  FPGA  on  which  it  is  implemented.  Because  of  our 
success  with  Bayesian  computing  machine  containing  16  processing  nodes  in  a  Virtex-5 
FPGA  (XCV5LX155T-2),  especially  because  of  the  strong  similarity  in  computing  pattern 
between  evolving  neural  network  and  Bayesian  network  computing,  we  are  quite  optimistic 
that  the  similar  two  orders  of  magnitude  of  performance  improvement  in  BCM  can  also  be 
achieved  in  the  proposed  multiagent  training  platform  against  a  2.4  GHz  Core  2  Duo  Intel 
processor  and  a  GeForce  9400m  using  the  CUBA  software  package.  In  addition,  we  expect 
that  all  these  huge  perform  improvements  are  obtained  with  only  tens  of  Watts  while  the 
reference  GPU  platform  consumed  more  than  200  Watts. 

The  architecture  of  our  hardware-assisted  multiagent  platform  was  highly  optimized  for 
training  large-scale  neural  networks.  More  specifically: 

•  Each  processing  node  only  implements  the  minimum  hardware  required  for  the  nec¬ 
essary  neural  network  computation.  Unlike  in  conventional  parallel  processors  imple¬ 
mented  with  ASIC  technology,  where  the  pipeline  stages  seldom  exceed  10  due  to  data 
dependencies,  in  the  proposed  multiagent  training  platform,  we  pipeline  each  process¬ 
ing  node  to  20~30  in  order  to  maximize  the  throughput.  Furthermore,  instead  of 
building  sophisticated  control  circuitry  to  handle  data  or  instruction  hazards,  the  mul¬ 
tiagent  training  platform  completely  avoids  dependencies  by  pre-processing  the  com¬ 
puting  task  graph  and  strategically  allocating  memory  for  intermediate  results  during 
the  execution. 

•  The  memory  of  the  multiagent  neural  substrates  is  physically  distributed  in  order  to 
take  advantage  of  the  massively  distributed  memory  blocks  in  a  modern  FPGA.  The 
communication  between  processors  and  memories  is  handled  by  two  separate  crossbar 
switches.  Combined  with  pre-processing  of  the  computing  task  graph  in  the  com¬ 
pilation  phase,  our  novel  memory  allocation  scheme  can  effectively  avoid  any  data 
dependency  and  memory  hazards. 

•  As  for  each  processing  node,  the  scheduler  that  controls  the  two  crossbars  is  also  deeply 
pipelined  to  improve  the  throughput.  Given  the  fact  that  all  information  needed  for 
memory  allocation  of  outgoing  messages  is  available  before  its  results,  we  can  start 
computing  the  schedule  right  after  the  first  stage  of  the  processor. 
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4.2  Main  Task  II:  Energy-Efficient  Multiplier-Less  Discrete  Con¬ 
volver  through  Probabilistic  Domain  Transformation 

Large-scale  convolution  is  a  fundamental  building  block  for  all  of  the  multiagent  learning 
algorithms  and  many  important  video  and  image  processing  algorithms.  Using  this  pur¬ 
chased  BEE4  platform,  we  have  designed  a  high-performance  reconfigurable  discrete  con¬ 
volver  specihcally  designed  for  FPGA-based  multiagent  training  processors.  While  the  con¬ 
ventional  multiplier-based  architecture  can  only  achieve  our  proposed  architecture 

can  achieve  approximately  0{N)  in  algorithmic  complexity,  therefore  highly  scalable  and  en¬ 
ergy  efficient.  This  signihcant  reduction  in  algorithm  complexity  is  made  possible  by  using 
probabilistic  domain  transformation  and  several  novel  reconhgurable  hardware  techniques 
to  convert  computationally  intensive  multiplications  into  additions  that  manipulate  random 
signal  samples.  In  addition,  these  techniques  make  the  proposed  architecture  highly  fault- 
tolerant  because  information  to  be  processed  is  encoded  with  probability  density  function 
instead  of  its  binary  forms.  As  such,  the  local  perturbations  of  its  computing  accuracy  or 
signal  values  are  inconsequential  to  its  overall  results. 

To  validate  this  proposed  probabilistic  architecture  for  discrete  convolver.  We  have  de¬ 
veloped  several  prototypes  with  Virtex  6  FPGA  devices  (XG6VLX550t).  To  further  op¬ 
timize  the  performance  of  our  probabilistic  convolver,  we  developed  four  novel  hardware 
techniques — Segmented  Memory  Swapping,  Stochastic  Mixing  Scheme,  Virtual  Indexing 
Scheme,  and  Scalable  Adder  Extension — to  minimize  hardware  usage  and  mitigate  mem¬ 
ory  bandwidth  bottleneck.  Our  prototype  of  probabilistic  convolver  requires  just  4.09  ps  to 
perform  a  128  x  128  convolution  and  dissipates  only  166.63  nJ  in  dynamic  energy  consump¬ 
tion  at  250  MHz.  We  believe  that  this  new  architecture  can  be  exploited  in  all  the  real-time 
applications  in  which  energy-efficient  convolutions  are  required  and  it  can  be  realized  with 
many  other  FPGA  device  families. 

5  Summary 

The  DURIP  project  officially  started  on  August  15,  2012.  All  equipments  were  purchased 
within  one  month.  However,  because  all  FPGA  platforms  are  custom-built,  the  hnal  arrival 
of  all  these  BEE4  platforms  was  delayed  to  May  2013.  Fortunately,  the  FPGA-based  high- 
performance  multiagent  training  platform  will  be  extensively  used  for  at  least  hve  years. 
After  that,  the  platform  will  be  upgraded. 

The  purchased  platforms  are  housed  at  the  Gomputer  Architecture  Laboratory  (Harris 
Engineering  Genter  Room  242  with  additional  resources  in  Room  238)  in  the  School  of 
Electrical  Engineering  and  Gomputer  Science  at  the  University  of  Gentral  Florida.  The 
MARG  training  of  defense-related  engineering  students  has  taken  place  through  the  lab 
sessions  of  course  EEL5722c  (Advance  FPGA  Design  and  Technology). 

We  have  included  the  following  as  “Synergistic  Benehts  to  DoD” : 

•  Research 
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—  We  developed  the  FPGA-based  high-performance  computing  systems  that  allevi¬ 
ate  the  computing  performance  bottleneck  imposed  by  software-based  simulation 
while  conducting  multiagent  learning  research. 

—  This  system  can  signihcantly  increase  the  multiagent  HyperNEAT  learning  inten¬ 
sity  to  facilitate  solving  real-world  problems. 

—  Through  this  purchase,  we  have  explored  how  to  enable  training  teams  at  much 
larger  sizes  than  previously  possible  in  simulation,  and  make  training  adaptive 
neural  networks  with  genuine  synaptic  plasticity  feasible. 

•  Education  (Both  Graduate  and  Undergraduate) 

—  For  the  first  time,  we  set  up  a  practical  educational  platform  at  UGF  for  many  of 
our  students  who  are  currently  working  (or  will  work)  for  defense-related  industry. 

—  This  DURIP  project  provides  valuable  opportunities  for  both  graduate  and  un¬ 
dergraduate  students  to  learn  real-world  hardware  design. 

All  these  outcome  from  the  proposed  equipment  acquisition  can  beneht  various  tasks 
relevant  to  DoD  missions  in  the  held,  such  as  dynamic  cordoning,  coordinated  surveillance, 
and  multi-robot  teleoperation.  Some  of  these  benehts  have  already  been  realized. 
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